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The statistical literature discusses different types of Markov properties for chain graphs that 
lead to four possible classes of chain graph Markov models. The different models are rather 
well understood when the observations are continuous and multivariate normal, and it is also 
known that one model class, referred to as models of LWF (Lauritzen-Wermuth-Frydenberg) or 
block concentration type, yields discrete models for categorical data that are smooth. This paper 
considers the structural properties of the discrete models based on the three alternative Markov 
properties. It is shown by example that two of the alternative Markov properties can lead to 
non-smooth models. The remaining model class, which can be viewed as a discrete version of 
multivariate regressions, is proven to comprise only smooth models. The proof employs a simple 
change of coordinates that also reveals that the model's likelihood function is unimodal if the 
chain components of the graph are complete sets. 

Keywords: algebraic statistics; categorical data; conditional independence; graphical model; 
Markov property; path diagram 

1. Introduction 

A graphical Markov model is a statistical model defined over a graph whose vertices 
correspond to observed random variables. The missing edges of the graph are translated 
into conditional independence restrictions that the model imposes on the joint distribu- 
tion of the variables [21]. Among the more complex graphical models are those based 
on chain graphs. Chain graphs may have both directed and undirected edges under the 
constraint that there do not exist any semi-directed cycles. The absence of semi-directed 
cycles implies that the vertex set of a chain graph can be partitioned into so-called chain 
components such that edges within a chain component are undirected whereas the edges 
between two chain components are directed and point in the same direction. 

The rules that govern how a graph is translated into conditional independence restric- 
tions are known as Markov properties. Four classes of Markov properties for chain graphs 
have been discussed in the literature, and we classify them as: 

Type I: the LWF or block concentration Markov property [12, 22]; 
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Type II: the AMP (alternative Markov property) or concentration regression Markov 
property [1]; 

Type III: a Markov property that is dual to the type II property; 
Type IV: the multivariate regression Markov property [4, 27], which can also be viewed 
as a special case of Markov properties for path diagrams [20, 25, 26]. 

The four types arise by combining two different interpretations of directed edges with 
two different interpretations of undirected edges (compare Section 2). 

The four classes of Gaussian (i.e., multivariate normal) chain graph models associated 
with the above Markov properties are rather well understood. In particular, they are 
known to be smooth (i.e., they are curved exponential families [17]). Discrete models 
for categorical data have been thoroughly explored under the Markov property of type 
I (LWF). The resulting models have log-linear structure [21], Section 4.6.1, which yields 
that the models are smooth exponential families. However, less is known about the other 
discrete models. We note that discrete type IV models are related to models employed in 
longitudinal data analysis; see, for example, [11], and that, despite being termed modified 
path diagram models, the models discussed in [14] are most closely related to type I 
models. 

This paper investigates smoothness properties of discrete models of type II, III and IV, 
which are all algebraic exponential families in the sense of [9]. Studying smoothness is 
important because the standard asymptotic distribution theory (e.g., normal distribution 
limits for maximum likelihood estimators and x 2 -limits for likelihood ratios) is valid in 
smooth algebraic exponential families but may fail in non-smooth models [6] . Smoothness 
of conditional independence models cannot be taken for granted as demonstrated by the 
following example; see also [2], Example 7. The example concerns two discrete random 
variables X\ and Xi that are independent marginally as well as conditionally given a third 
binary variable X 3 ; in symbols, X\ JL X2 and X\ JL X2 | X 3 . The corresponding subset of 
the appropriate probability simplex is a union of two sets corresponding to X\ JL {X2, X 3 ) 
and X2 JL (Xi,X 3 ), respectively. The set defined by X\ JL (V2, V3) is a smooth manifold 
and so is the set given by X2 JL (Xi, X 3 ). Their union, however, fails to be smooth where 
the two components intersect. This intersection corresponds to complete independence 
of X\ , X2 and X 3 . Details on how the presence of singularities in this example affects the 
behavior of a likelihood ratio test can be found in [9], Section 4.2 and [6], Example 2.7, 
where the Gaussian version of the problem is treated. The Gaussian case is analogous 
since for a jointly multivariate normal random vector {X\,X2,X 3 ) it also holds that 
Xx JL X 2 and Xx JL X 2 | X 3 is equivalent to X x JL (X 2 ,X 3 ) or X 2 ±(X 1 ,X 3 ). 

The main result of this paper shows that discrete type IV models are smooth. Stated 
in Corollary 10, this result follows from a linear change of conditional probability coor- 
dinates that simplifies the conditional independence constraints in the model definition 
(Theorem 8 in Section 3). Moreover, type IV models have unimodal likelihood functions 
if the chain components of the underlying graph arc complete sets, in which case the 
models of type II and type IV coincide (Section 4). Finally, we show by example in Sec- 
tions 5 and 6 that the classes of type II and III include non-smooth models. The paper 
concludes with the discussion in Section 7. 
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Figure 1. (a) Chain graph with chain components {1}, {2}, {3,4} and {5,6,7,8}; (b) a graph 
that is not a chain graph. 

2. Chain graphs and their Markov properties 
2.1. Chain graphs 

Let G = (V, E) be a graph with finite vertex set V and edge set E C (V x V) \ {(v, v) | v £ 
F}. An edge (v, w) £ E is directed if (iy, u) ^ £ and undirected if £ -E. We denote 

a directed edge (u, ui) by u ->w and write u — w if (u, iu) is undirected. If (v, w) £ E then 
v and w are adjacent. If u — > w then w is a parent of u>, and if d — w then f is a neighbor 
of it'. Let pa G (i>) and nbc(f ) denote the sets of parents and neighbors of v, respectively. 
For aCV, let 



and NbG(c) = nbG(c) U a. 

A path in G is a sequence of distinct vertices {vq, . . . ,Vk) such that and u 2 : are 
adjacent for all 1 < i < k. A path («o, • ■ ■ , ^fe) is a semi- directed cycle if (w^, i>i+i) G i? for 
all < i < k and at least one of the edges is directed as vi — > Vi+\. Here, w^+i = vq. A 
chain graph is a graph without semi-directed cycles (see Figure 1). Define two vertices 
Vq and Vk in a chain graph G to be equivalent if there exists a path (u , . . . , v^) such that 
Vi — v i+ i in G for all < i < k — 1. The equivalence classes under this equivalence relation 
are the chain components of G. The chain components (t\t £ 3") yield a partitioning 
of the vertex set 



and the subgraph G T induced by each chain component t is a connected undirected 
graph. Moreover, the directed edges between two chain components t\ and ti all have 
the same direction, that is, if (v, w) £ t± x t-i and (x, y) £ t\ x t-2 are two pairs of adjacent 
vertices, then either v — > w and x — * y or w — > v and y — ► x in G. It follows that we can 
define an acyclic digraph (DAG) 13 = D(G) over the chain components: 2? is the vertex 






(2.1) 
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Figure 2. DAG of chain components for the chain graph from Figure 1(a). 



set of D, and we draw an edge T\ — ► T2 if and only if there exists v £ ti and wGr2 with 
v — > w in G. 

Example 1. Consider the chain graph G in Figure 1(a). It has four chain components, 
namely, {1}, {2}, {3,4} and {5,6, 7,8}. The DAG D(G) has these four chain components 
as nodes and is depicted in Figure 2. 

2.2. Block-recursive Markov properties 

A Markov property of a graph G = (V, E) lists conditional independence statements 

ai-lL/Si|7i, i = l,...,fe, 

for triples (cKj,/3,,7i) of pairwise disjoint subsets of V with aj, /3j 7^ . These triples are 
determined by the edge set E. The joint distribution P of a random vector X £ M. v obeys 
the Markov property if for all 1 < i < k, the subvector X ai is conditionally independent 
of Xp i given X lt . If 73 = then the conditional independence is understood as marginal 
independence of X ai and . 

Block-recursive Markov properties for a chain graph G employ the recursive structure 
of the chain components captured in the DAG D = D(G); see [1, 21]. For r G let 
pa £3 (r) be the union of all f £ & \ {t} that are parents of r in D. Similarly, the set of 
non- descendants nd£>(r) is the union of all f € \ {r} for which there is no directed 
path f — > ■ ■ ■ — > t in D. The following conditional independence statements are associated 
with the DAG D: 

TJL[nd D (T)\pa D (T)]\v& D (T) Vre^. (CI) 

If the joint distribution has a density with respect to a product measure, then the con- 
ditional independence relations in (CI) are equivalent to the density factorizing over the 
graph [21]. We will employ the factorization over the DAG D in our study of discrete 
chain graph models; see (3.2). 

Example 1 ( Cont.). If we consider the chain component r = {1} of the chain graph G 
from Figure 1(a), then pa D (r) = and ndnir) = {2}. Hence, (CI) states 



{l}±{2}. 
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If r = {5,6,7,8}, then pa D (r) = {3,4} and nd D (r) = {1} U {2} U {3,4}, which leads to 

{5, 6, 7, 8} JL {1,2} | {3, 4}. 
The factorization of a joint density /(xi, . . . ,2:7) alluded to above takes the form 
f(x 1: ...,x 8 ) = f(xi)f(x 2 )f(x3,X4 I Xi,X2)f(x 5 ,X 6 ,X 7 ,X S I x 3 ,x 4 ). 

Each chain component r£j induces an undirected subgraph G T . Applying the local 
version of the classic Markov property for these undirected graphs (see, e.g., [21]) to the 
conditional distribution of the variables in r given the variables in pa^, (r) leads to the 
conditional independence statements 

cr_L [r\Nb G (cr)] I [pa D (r) Unb G (cr)] VrG^,VcrCr. (C2a) 

If the conditional distributions for r given pa^, (r) have positive densities with respect to 
a product measure, then the Hammersley-Clifford theorem implies that the conditional 
independence relations in (C2a) correspond to factorizations of the conditional densities; 
see again [21] for the precise results. As an alternative to (C2a), we can employ a dual 
Markov property for undirected graphs (discussed, e.g., in [18]) that yields 

<7_L [r\Nb G (cr)] |pa D (r) Vr 6 ST, Ver C r. (C2b) 

Both (C2a) and (C2b) describe the consequences of the absence of undirected edges 
within a chain component r, but contrary to (C2a), the conditional independence rela- 
tions in (C2b) are generally not related to density factorizations. 

Example 1 ( Cont.). Let r = {5, 6, 7, 8} be the largest chain component of the graph G 
in Figure 1(a), for which pa £) (r) = {3,4}. If a = {5,7}, then nb G (c) = {6} and Nb G (er) = 
{5,6,7}. Therefore, (C2a) states that 

{5,7}X{8}|{3,4,6}, 

whereas (C2b) states that 

{5,7}X{8}|{3,4}. 

The final ingredient to the block-recursive Markov properties describes finer depen- 
dence structures associated with the absence of directed edges. Again there are two 
versions, namely, 

a JL [pa D (r) \pa G (a)] | [pa G (cr) U nb G (er)] VtG^VctCt (C3a) 

and 

a JL [pa £) (r) \ pa G (cr)] | pa G (<r) VrG^,VcrCr. (C3b) 

The two versions differ by whether vertices from the considered chain component r arc 
included in the conditioning set or not. 
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Figure 3. Chain graph with chain components {1} and {2,3,4}. 



Example 1 (Cont.). Consider again the graph G in Figure 1(a) and the chain 
component t = {5,6,7,8} with pa D (r) = {3,4}. If a = {5,7}, then pa G (er) = {3} and 
nbc (ct) = {6}. Therefore, (C3a) states that 



We are now ready to formally define the four types of Markov properties mentioned 
in the Introduction. 

Definition 2. Let G be a chain graph with chain components (r | r € 3?). A block- 
recursive Markov property for G states (CI), one choice of either (C2a) or (C2b), and 
one choice of either (C3a) or (C3b). The block-recursive Markov property is of 



As can be seen also in the next example, the Markov property of type I is the 'most 
conditional' with the largest conditioning sets, whereas type IV is the 'most marginal' 
with the smallest conditioning sets. Types II and III mix marginal and conditional per- 
spectives. 

Example 3. Let G be the chain graph in Figure 3. Then (CI) is a void statement. The 
remaining statements can be summarized as follows: 



{5, 7} JL {4} | {3, 6}, 



whereas (C3b) states that 



{5,7}iL{4}|{3}. 




type III 



type IV: 



type II 



type I 



2X4 | {1,3} and 1 JL {2,4} | 3, 
2 JL 4 | {1,3} and 1 JL{2,4}, 
2 JL 4 | 1 and 1 JL {2,4} | 3, 
2 JL 4 | 1 and 1 JL{2,4}. 
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In Section 3 we will consider discrete chain graph models of type IV. When studying 
these models we will exploit the following useful simplification of (C2b) that is due to 
[25], Theorem 4. This simplification is based on connected sets, which are subsets of the 
vertex set that induce a connected subgraph. 

Lemma 4. A probability distribution obeys the conditional independence relations (C2b) 
if and only if it obeys 

a JL [r \ Nb G (er)] | pa £) (r) W E S? , Vct C t, a connected. (C2b-conn) 

Remark 5. In a multivariate normal distribution, two pairwise marginal independences 
v JL w and v JL u imply that v JL {u, w}. As a consequence, Gaussian chain graph models 
can be discussed in terms of pairwise Markov properties that list conditional indepen- 
dences between pairs of non-adjacent vertices in the graph; compare, for example, [7, 27]. 
When considering discrete random vectors taking only finitely many values, the pairwise 
Markov property for type I models is still equivalent to the respective block-recursive 
property as long as one limits oneself to positive joint distributions [12], Theorem 3.3. 
However, for the models of types II/III/IV a positive discrete distribution that obeys the 
pairwise Markov property will generally not obey the block-recursive Markov property. 
This follows from the fact that in almost every positive joint distribution that exhibits 
v JL w and v JL u, v is not independent of both {u, w}. For instance, if G is the graph 
in Figure 3, then the pairwise model of type II (AMP) would be based on 2 JL 4 | {1,3}, 
1 JL 2 and 1 JL 4. Similarly, the pairwise model of type IV (multivariate regression) would 
be based on 2 JL 4 | 1, 1 JL 2 and 1 JL 4. Such pairwise models will generally be of consid- 
erably larger dimension than their block-recursive analogs. 

3. Discrete models of type IV 

Let X = (X v | v E V) be a discrete random vector with component X v taking values in 
[d v ] — {1, . . . , d v }. Let 1 — \ v£V [d v j. For i = (i v \ v E V) E 1, let 

p(i) = P(X v =i v for allweF). (3.1) 

The joint distribution of X is determined by the probability vector 

p = (p(i) | i S J) 

in the \I\ — 1 = (Iluey d v ) ~ 1 dimensional probability simplex A. Let A° be the interior 
of the probability simplex, which corresponds to the regular exponential family of positive 
distributions on X. 

Definition 6. The discrete chain graph model Prv(G) associated with the chain graph 
G= (V,E) is the set of (positive) probability vectors in A° that yield a distribution on I 
that obeys the block-recursive Markov property of type IV (multivariate regression). 
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The higher-level structure of the model Prv(G) is determined by condition (CI). For 
a probability vector p £ A°, this condition is equivalent to the condition that 

p ^ = IT p ( ir I VdM)' ( 3 - 2 ) 

where 

P(*r | ipa D (r)) = P{X T = V I ^pa D (r) = *pa D (r))- 

The factorization in (3.2) is the usual factorization over a DAG, but applied to the DAG 
of chain components D = D(G). For a subset a C V, define I a = ~X. v< z a [d v ]- For fixed 
*pa D (r) £ 2pa D (T)i the vector with components p(i T | i pa , D t T \), V S X T , is a probability 
vector in the interior of the \X T \ — 1 dimensional probability simplex A°. 

In [8], a linear change of coordinates is used to simplify the description of conditional 
independence constraints that correspond to (C2b). We will now show how to generalize 
this change of coordinates to a version involving conditional probabilities that simplifies 
both (C2b) and (C3b). 

Consider a subset ^ a C r of a chain component r £ 3? '. Let z pa r T ) £ T pa r T ) be a 
given conditioning state. Define the restricted state space 

Ja= X [d„-l]. 

The set [d v — 1] = {1, . . . , d v — 1} is the state space for random variable X v but with the 
highest-numbered state d v removed. (Any other state could be removed instead.) With 
each state j a £ J a we associate the conditional probability 

9(ia|V D (r)) = P{X a = ja\X p&D{T) = V D ( T )), (3.3) 

which we call saturated Mobius parameter for cr and jo- given (r, i paj -,(r))- For fixed chain 
component r and conditional state i P a D (r) 5 but a varying through the power set of r and 
i a varying through J a , there are \T T \ — 1 many saturated Mobius parameters. They can 
be computed from the conditional probabilities p(v|* P a D (T)) by the obvious summations. 
These summations define a linear map [i, : A T — > Rl 1 ^ -1 taking the conditional probabili- 
ties p(i T |« P a D (r))j V € 2r, to the saturated Mobius parameters given (t, i pa ( T )). We note 
that these parameters are closely related to conditional versions of dependence ratios; 
see [10] and references therein. 

The following fact corresponds to Proposition 6 in [8] , which is based on the well-known 
Mobius inversion; see also [16] where the Kroncckcr product structure of the matrices for 
the linear maps [i and yT x is described. 

Lemma 7. The linear map [J.:A T —>M} Xt ^ 1 from conditional probabilities to saturated 
Mobius parameters is bijective with the inverse map determined as follows: Let i T £ T r 
and define 

cr = a{i T ) := {v £ r\i v £ [d v — 1]} C r. 



744 



M. Drton 



Letq(j \i paD ( T )) = l. Then 

a-.aCaCr Oa\a^Ja\a 

As we show next, both (C2b) and (C3b) take on a simple form when expressed in 
terms of the saturated Mobius parameter coordinates. We use that every set S C r that 
is not connected in G can be partitioned uniquely into inclusion-maximal connected sets 
71,..., 7r Qr, 

<5 = 7iU7 2 U---U7 r . (3.4) 



Theorem 8. Let G be a chain graph with chain components (r|r G ^7). A probability 
vector p € A° belongs to the discrete chain graph model Piv(G) if and only if the following 
three conditions hold: 

(i) The components of p factor as in (3.2). 

(ii) For all r G ST and i pa ( r ) G X pa ( T ), the saturated Mobius parameters for p satisfy 
that 

q{js\i VB . D {T)) =?(i7ilV D (-r))?0*72l Vc(r)) " •ffOVlVr.W) 

/or every disconnected set 8 C r and G . _ffere 71, . . . , 7r C r are i/ie inclusion- 
maximal connected sets in (3.4-). 
(hi) .For aZZ r G connected subsets 7 C r and j 7 G J7 7 , i/ie saturated Mobius param- 
eters for p satisfy that 

?(i 7 IV D (T)) = q(3j\kpa.n(T)) 

for every pair ? P a D (r), fc P a D (r) eV D W suc/l i/lai V G ( 7 ) = fc pa G ( 7 )- 

Proof. Clearly, the factorization (3.2) required in (i) is equivalent to (CI). Applying 
Theorem 8 in [8] to each of the conditional distributions associated with the different 
vectors « paD ( T ) eI paD ( T ), we see that (ii) is equivalent to (C2b). 

Condition (C3b) states that 7 _1L pa £ ,(r) \ pa G (7)|pa G (7) for all subsets 7 C r G S? of 
the chain components ' . The conditional independence for given 7 and r holds if and 
only if 

pfhlHw) = P(h\ k pa D (r)) (3-5) 

for all £ 7 G 1 1 and every pair i pa , D ( T ), k P a D (r) S 2pa D (T) such that V G ( 7 ) = fc P a G ( 7 )- If 
i 7 G J7y, then 

P(«7 V D (r)) = ?(« 7 I «pa D (T))- (3-6) 

Hence, (C3b) implies (iii). 
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For the reverse implication we claim that (ii) and (iii) imply (C3b). Fix i p3bD ( T ), fc P a D (r) € 
2"pa D ( T) such that « P a G ( 7 ) = fc P a G ( 7 )- By Lemma 7 and (ii). we can express p(i 7 \i paD (r)) as 
a function of saturated Mobius parameters q(j a \ipn D (T)), where a is a connected subset 
of 7 and j a G J a . By (iii). q(ja\i pa , n ( T )) = Q(ja\k pa , D ( T )), and thus (3.5) holds. □ 

Example 9. In the chain graph in Figure 3, chain component {1} is a singleton without 
parents and the associated saturated Mobius parameters are q(ji), ji G [d\ — 1]. The 
saturated Mobius parameters for the second chain component {2, 3, 4} are the conditional 
probabilities 

gO'aNi), qih\h), g(.?4Ki), 

q(h,33\h), q(j2,ji\ii), q(j2,j4\ii), q{j2,3z,u\h), 

where i\ G [di], ji G \di — 1], G [d^ — 1] and j'4 G [d^ — 1]. The saturated Mobius param- 
eters correspond to a probability vector in Piv (G) if and only if the following equations 
hold for all i\,k\ G [d\] and j v G [d v — 1], v > 2: 

g(j2,.74|ii) =Q , 0'2|ii)g , (i4|ii), 

?(j2|«l) =?0'2|fcl), 

9(i4|«i) = g(i4|fci), 

The first equation is given by Theorem 8(h), the others by Theorem 8 (iii). 

Theorem 8 can be read as expressing certain saturated Mobius parameters as functions 
of the remaining ones. One obtains a parametrization of Piv(G) in which the parameters 
are the conditional probabilities 

9 7 (j 7 IV G ( 7 )) =P ( X j = i7l^pa G (-y) = V G ("r)) ( 3 - 7 ) 

with j 7 G Jj-y and « P a G ( 7 ) G I v&G ^y We call the probabilities in (3.7) the Mobius parame- 
ters for model Piv(G). For a chain component r G 5? , let C(r) be the family of connected 
sets in the induced subgraph G T , and define the vector 

5r = (<? 7 (.7 7 lv G ( 7 ))l7 eC(r), j 7 G J y , V G ( 7 ) eIpa G ( 7 ))- 

Let Q T be the set of vectors q T that are obtained from some p G Prv(G). Moreover, let 
q = (Qt\t G and define Qg to be the set of vectors q obtained from some p G Prv(G). 
The set 

Qg= X Q T (3.8) 

is a Cartesian product, that is, the Mobius parameters from different chain components 
are variation-independent. Each set Q T , however, is constrained via polynomial inequal- 
ities and no additional factorization of Q T into a Cartesian product seems possible in 
general. 
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The equations among saturated Mobius parameters that appear in Theorem 8 are of 
particularly simple nature and reveal that the set Prv(G) is a smooth manifold in the 
interior of the probability simplex; compare, for example, [13], Theorem 1. 

Corollary 10. For every chain graph G the discrete chain graph model Piv(G) is a 
curved exponential family. 

Corollary 10 implies that in discrete chain graph models of type IV, maximum likeli- 
hood estimators are asymptotically normal, likelihood ratio statistics for model compar- 
isons have asymptotic ^-distributions, and the Bayesian information criterion is con- 
sistent for model selection [17]. For application of likelihood ratio tests and information 
criteria, it is important to know the dimension of Piv(G), which is readily obtained from 
the chain graph G and the numbers of states the involved random variables may take. 

Corollary 11. The dimension of the discrete chain graph model Piv(G) is 

dim(p IV (G))=£ yi (n^- 1 ))! n <*»)■ 

t£SC£C{t) \vEC ) Yu>epa G (C) / 

4. Likelihood inference in models of type IV 

Continuing our discussion of models based on the multivariate regression Markov prop- 
erty (type IV), suppose we observe a sample , . . . , X^ n > of independent and identically 
distributed discrete random vectors taking values in I = XuevKJ- Suppose further that 
the joint distribution common to the random vectors in the sample corresponds to an 
unknown probability vector p € Piv(G), where G = (V, E) is a chain graph. If we define 
the counts 

n 

n(i) = Y 1 {x<x)= l }, 
then the likelihood function of Piv(G) is equal to 

L{p) = \{p^\ (4.1) 

iei 

This likelihood function admits a factorization that we express in (4.2) using the log- 
likelihood function £(p) = log L(p). 
For a C V and i a £l a , let 



n(i a ) = n(j) 
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and 

p(i a ) = P(X a = i a ) = ^2 P(J)- 
By (3.2), the log-likclihood function can be written as 

£{P)=Y1 "(V, V D (r)) lo gP(vlv D (r))- ( 4 - 2 ) 

re.y i T ei T jpa D ( T )eip aD ( T ) 

Since the Mobius parameter space Qq factors accordingly into a Cartesian product, see 
(3.8), it follows that we can maximize £(p) over Prv(G) by maximizing the component 
log-likelihood functions 

Zr{p)=Y^ Y ^^VdMHosKvIv^t)) (4-3) 

separately for each J. Combining the optima from these separate constrained maxi- 
mizations according to (3.2) yields a maximum p of the likelihood function over Prv(G). 
It is an open question whether or not the likelihood function of the model Prv(G) can 
be multimodal. 

In some situations, some of the components of a maximum likelihood estimate in 
the model Prv(G) may be empirical proportions. Recall that a set of vertices a C V is 
complete if every pair of vertices in a is joined by an edge. 

Proposition 12. (i) // the chain component r of the chain graph G is a complete set 
and pa, D (r) = 0, then the maximum likelihood estimator of the marginal probability p(i T ) 
is the empirical proportion 

p(v) = ■ 

n 

(ii) If the chain component r is a singleton, then the maximum likelihood estimator of 
the conditional probability p(i T \i v&D ) is the empirical proportion 

n(y,y D(T) ) 
P(rlW n(i pa J ■ 

Proof. Both observations are immediate consequences of the likelihood factorization 
provided by (4.2) and (3.8). □ 

The Mobius parameters of a model Prv(G) generally satisfy nonlinear polynomial 
inequalities. However, if the chain components of G are complete, then linear structure 
arises, which leads to the following fact: 

Theorem 13. If the chain component r of the chain graph G is complete and all counts 
fi(i T , ip& D (T)) are positive then the component log-likelihood function £ T in the model 
Piv(G) (see (4-3)) has a unique local and thus global maximum. 
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Proof. Under the assumed completeness of r, Theorem 8(ii) imposes no constraints on 
the probabilities p(i T \i pSi ( T )). Since condition (iii) in Theorem 8 imposes only linear con- 
straints, each p{i T \i-pa D (T)) is a linear function of the Mobius parameters <Z 7 (j 7 |j p a G (7))j 
7 € C(r), defined in (3.7). If / denotes the injective linear map from these Mobius param- 
eters to the probabilities p(i T \ipn ( r ))i then the log-likclihood function in terms of the 
Mobius parameters is strictly concave because it is the composition of / and the strictly 
concave function £ T in (4.3). Moreover, the domain of definition of this log-likclihood 
function is the interior of a polyhedron, and in particular convex. In order to see this 
note that the probabilities p(i T \ipn (t)) lie m a Cartesian product of open probability 
simplices. The preimage of this Cartesian product under the linear map / is the interior 
of a polyhedron and gives the desired domain of definition; see also [24], Section 1.2.1. 
The claim now follows because a strictly concave function has a unique local maximum 
over a convex set. □ 

We remark that Theorem 13 also applies to discrete chain graph models of type II 
(AMP) because Pn(G) = Prv(G) if G has complete chain components. 



5. A non-smooth model of type II (AMP) 

Prior work on discrete models of type I (LWF) and our new results on type IV models 
establish that both these model classes comprise only smooth models. In this section we 
give an example that shows that the same does not hold for discrete chain graph models 
of type II (AMP). 

Let G = (V, E) be the chain graph in Figure 3. The model Pn(G) contains the positive 
distributions for which 

2_L4|{1,3} and 1_L{2,4}; 

recall Example 3. In order to analyze Pn(G) we exploit that Pn(G) C Prv(G), where G 
is the chain graph obtained by adding the edge 2 — 4 to G. The model Piv(G) comprises 
the positive distributions satisfying 1 _1L {2,4} and, according to (3.7), this model can be 
parametrized using the marginal probabilitcs 

Qi(ji), Q2U2), 94C74), 924,(32,34) (5.1) 
and the conditional probabilities 

Qs(33\h), 9230'2,j3|«l), 934(j3, J4|il), <?234(j2, 33, 34 \ h ) , (5.2) 

where i\ £ [e?i] and jk <E [dk — 1] for k = 2, 3, 4. The conditional independence 2 _1L 4|{1, 3}, 
however, is not exhibited by a generic distribution in Piv(G) and leads to additional 
constraints. 
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For i\ £ [di] and 13 G [c?3 — 1], let Q( ll '*s) b e the d2 x d 4 -matrix that has entries 

(5.3) 



' (21 i 



Q , 234(t2,i3 ) *4|ii), if «2 < d 2 and i 4 < di, 

(ii,i 3 ) _ J ?23(i2,i3|ii)i if i 2 < d 2 and i 4 = d 4 , 

?34(i3)U|ii)j if *2 = rf 2 and 14 < d 4 , 

g 3 (i 3 |ii), if j 2 = d 2 and i 4 = d 4 . 

For ii € [di] and i% = d?,, we define Q( n ' d3 ) to be the matrix with entries 

{<Z24(«2,«4) - <?234(i2,+,24|ii), if «2 < d 2 and i 4 < d 4 , 

q 2 (i 2 ) ~ 923(^2, if «2 < c?2 and i 4 = d 4 , 

94(«4) - 934(+,«4|«i), if «2 = d 2 and i 4 < d 4 , 

1 — j if *2 = d 2 and z 4 = d 4 . 



(5.4) 



In (5.4), the replacement of index i^ by + stands for summation over 13 G [c?3 — 1] such 
that, for example, 



ds-l 



<?3(+|«l) = 93(«3|«l)- 



i 3 = l 



Proposition 14. Let G 6e f/ie <?rap/i m Figure 3, and p G Pn(G) = Prv(G). TTien, 
p G Pn(G) «/ and ordy i/ the Mobius parameters from (5.1) and (5.2) satisfy that for all 
i\ G [di] and i% G [ds], f/ie matrix Q^ 1 ' 1 '^ has a rank of one at most. 

Proof. A joint distribution with probability vector p satisfies 2 _1L 4|{1, 3} if and only if 
for all i\ G [di] and 13 G [d 3 ] , the d2 x d 4 -matrix 

/p(ii,l,i 3 ,l) ... p(ii,l,i 3 ,d 4 ) \ 

: : (5.5) 

\p(ii,d 2 ,i 3 ,l) ... p(ii,d 2 ,i3,d4) J 

has a rank of one at most; see, for example, [24], Section 1.5. Using Lemma 7 and 
Theorem 8, each matrix P 2 ±4|{i 3} can ^ e rewritten in terms of the Mobius parameters in 
(5.1) and (5.2). This requires forming polynomial expressions in the Mobius parameters, 
but because {2,3,4} is a complete set in G, these expressions are equal to the product of 
a linear term and a marginal probability q\(i\) = P{X\ = i\). Since p is positive, we can 
cancel out the marginal probability arriving at a matrix filled with linear expressions; 
this is equivalent to conditioning on variable X\. After adding rows 1 to d2 — 1 to the 
last row with index d2 and columns 1 to d 4 — 1 to column d 4 , we arrive at the matrix 
Q(ii,< 3 ). These row and column operations preserve rank and thus the claim follows. □ 



P 



(11,13) 



2_L4|{1,3} 



In order to make our point about non-smoothness of type II (AMP) models, we consider 
four binary variables, that is, d\ = d 2 = d 3 = d 4 = 2. In this case, there are twelve Mobius 
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parameters in (5.1) and (5.2), and the rank constraints in Proposition 14 require the 
vanishing of four 2 x 2-dctcrminants. For more compact notation, let 

<? Q =<7a(l Q ), aC{2,4} 

and 

q a \i=Qa(l- a \i), {3}CaC {2,3,4}. 
Then the two determinants for 13 = 1 yield the equations 

?23|i<?34|i = <23|iQ234|i, « = 1, 2, (5.6) 

whereas the two determinants for i% = da = 2 yield 

<33K<?24 — <?23|i<?4 — 9341^2 + <?234|i = <?24 — g2Q r 4, 1=1,2. (5.7) 

We remark that the equations in (5.6) are an instance of a factorization in undirected 
graphical model [21]: the singleton {3} is a separator of the two cliques {2,3} and {3,4} 
in the undirected induced subgraph G{2,3.4}- 

The four equations in (5.6) and (5.7) define an eight-dimensional algebraic set in R 11 ; 
we omit the irrelevant Mobius parameter q\. Using the software Singular [15] we can 
compute the singularities of this set. (See [3] for a definition of singularities.) We find 
that the singular locus is determined by the equations 

<?2<?4 = q24, l3\iQ.2 = 92S\i, Q3\iH = ^34|i, <72<7 3 |i94 = <?234| 4 , » = M, (5.8) 

which by an appeal to Theorem 8 implies the following fact: 

Proposition 15. Let G be the graph in Figure 3, and G smg the subgraph that has the 
edges 2 — 3 and 3 — 4 deleted. If all variables are binary (di = 2), then the singular locus 
o/Pn(G) is equal to the submodel Pn(G s i ng ) = Piv(G s i ng ). 

An example of a statistical consequence of the non-smoothness of Pn(G) is that a x 2 - 
approximation is inappropriate for the likelihood ratio test of Pn(G S i ng ) versus Pn(G); 
compare [6]. 

Remark 16. We conjecture that Pn(G) is non-smooth regardless of the number of 
levels di for the four random variables. Using Singular, we were able to verify the claim 
of Proposition 15 when X% is ternary, that is, d\ = d^ = d± — 2 and d^ = 3. Moreover, 
we could compute the case c?2 = ^3 = di = 2 and d\ = 3 for which Pn(G s i ng ) is a only a 
proper subset of the singular locus. 
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6. A non-smooth model of type III 

Let G be the chain graph in Figure 3. As seen in the previous section, the binary type 
II model Pn(G) is non-smooth. Nevertheless one can give a rational parametrization 
of Pn(G) by solving the equations (5.6) and (5.7). In this section we show that the 
binary type III model Pni(G) is the union of two strict submodels defined by polynomial 
equations, which implies that the model is non-smooth and cannot be parametrized using 
a rational map. The non-existence of a rational parametrization follows from the fact 
that an algebraic set with rational parametrization is irreducible [3], Section 4.5. An 
algebraic set, that is, a set defined by polynomial equations, is irreducible if it cannot be 
decomposed into a finite union of strict algebraic subsets. 

Let X\ , . . . , X4 be binary variables in correspondence with the nodes of G. As stated 
in Example 3, Pni(G) comprises the positive distributions for which 

2JL4|1 and 1_1L{2,4}|3. (6.1) 

Define the two 2 x 2-matrices 

pw (W wi +2 y i=12 (6 . 2) 

2 " L411 \Pi2+l Pi2+2 J 

and the two 2 x 4-matrices 

p(k) _ ( P\\k\ P\\k2 P\2k\ P\2k2 

1_L{2,4}|3 yp 21kl p 21k2 p 22kl p 22k2 

Here, p ijM = p(i,j,k,£) and p ij+l = P(X X = i, X 2 = j, X 4 = t) for i, j,£= 1,2. 

A probability distribution on [2] 4 satisfies the two conditional independences in (6.1) 
if and only if the four matrices in (6.2) and (6.3) have a rank of one at most. This rank 
condition together with the constraint that the probabilities Pijkt sum to one defines 
a seven-dimensional algebraic set in R 16 . By computing a primary decomposition using 
Singular, this set is seen to decompose into the union of two strict algebraic subsets that 
both have dimension seven. (See again [3] for an introduction to the involved algebraic 
concepts.) 

Proposition 17. Let G be the graph in Figure 3. If all variables are binary (alt = 2), 
then a positive probability vector p — (pijke) is in Pin(G) if and only if at least one of 
the following two conditions is met: 

(i) 1JL{2,3,4} and 2 JL {1,4}, or 

(ii) 2_1L4|1 and 1 JL {2, 4}|3 and 

P1121P2222 - P1122P2221 = P1111P2212 - P1112P2211, (6.4) 

P1221P2122 -P1222P2121 =Pl21lP2112 -P1212P2111- (6-5) 



ft = 1,2. 



(6.3) 
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The equations (6.4) and (6.5) are equalities of 2 x 2-minors of the two matrices 

p(k) _ / pilfel Pl2fcl P21fcl P22fcl \ k=l 2 (6 6) 

{1,2}J_4|3 ^p llfc2 Pl2fe2 P21fc2 P22fc2 / ' 

associated with the conditional independence {1,2} _1L4|3. Hence, for these equations to 
hold some homogeneity between possible conditional dependence between (X\,X%) and 
X4 given X 3 = 1 and possible conditional dependence between (Vi,V 2 ) and X4 given 
X3 = 2 is required. 

In order to show that the two components of Pni(G) are indeed non-trivial and distinct 
from each other over the interior of the probability simplex we give the following two 
examples: The probability vector 



/P1111 P1112 P1121 PU22\ 

P1211 P1212 P1221 P1222 I 

P2111 P2112 P2121 P2122 I 

\P2211 P2212 P2221 P2222/ 




(6.7) 



satisfies the condition in Proposition 17(i) but not the one in (ii), whereas 

1 \ 



/Pllll P1112 P1121 Pll22\ 

P1211 P1212 P1221 P1222 

P2111 P2112 P2121 P2122 I 

Vp 22 11 P2212 P2221 P2222 / 




(6.8) 



satisfies (ii) but not (i). 



Remark 18. More complicated computations in Singular are feasible. If X3 is ternary 
(di =di = di = 2 and d% = 3), then the algebraic set corresponding to the model Pm(G) 
breaks into six components. If X\ is the only ternary variable (d,2 = g?3 = di = 2 and 
d\ =3), then the set is irreducible. Nevertheless, we conjecture that Pin(G) is non- 
smooth regardless of the choice of d\, . ■ . , di. 



7. Discussion 

The main contribution of this paper concerns discrete chain graph models of type IV. 
These models are related to multivariate regression and can also be derived from a path 
diagram interpretation for the chain graph. In fact, the block-recursive Markov property 
of type IV can be shown to be equivalent to the global Markov property discussed, for 
instance, in [20, 25, 26]. In this paper we showed that, just like their Gaussian analogs, 
type IV models are curved exponential families (Corollary 10). This brings with it all the 
convenience of the standard asymptotic theory for likelihood inference. 

Practical use of the discrete models requires algorithms for maximization of the like- 
lihood function, and at least two approaches are possible. On one hand, one may write 
the likelihood function as a function of the Mobius parameters from (3.7) and then apply 
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general optimizers. On the other hand, the iterative conditional fitting algorithm of [8] 
can be extended to the chain graph explained in [5]. 

The approach we took when studying type IV models is also useful for analyzing dis- 
crete models of type II (AMP). While much of the related work on models for categorical 
data employs log-linear expansions (see e.g. [2, 19, 23]), the change of coordinates in 
Theorem 8 remains at the level of conditional probabilities. This preserves the algebraic 
structure that was useful for showing that the type II class includes models with singu- 
larities (Section 5). An interesting problem for future research would be to characterize 
all chain graphs that yield smooth discrete (or perhaps more concretely, binary) chain 
graph models of type II. In particular, an interesting question is whether there exist chain 
graphs G for which Pn(G) is smooth and such that there does not exist a chain graph 
G with Pn(G) = Pi(G) or Pn(G) = Piv(G). Similar questions arise for models of type 
III that may also be non-smooth (Section 6). 

Finally, we recall Remark 5, where we commented on possible pairwisc Markov inter- 
pretations of chain graphs. While these can be awkward in the sense that the focus may 
be on distributions for which v JL u, v _1L w but not v JL {u,w}, it is interesting that the 
pairwise type II interpretation of the graph in Figure 3 yields a smooth model. 
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